arXiv:1505.06864vl [physics.comp-ph] 26 May 2015 


Decomposition method for block-tridiagonal matrix systems 


P. A. Belov*, E. R. Nugumanov', S. L. Yakovlev 

Department of Computational Physics, Saint-Petersburg State University, Ulyanovskaya I, 198504 Saint-Petersburg, Russia 


Abstract 

The decomposition method which makes the parallel solution of the block-tridiagonal matrix systems possible is 
presented. The performance of the method is analytically estimated based on the number of elementary multiplicative 
operations for its parallel and serial parts. The computational speedup with respect to the conventional sequential 
Thomas algorithm is assessed for various types of the application of the method. It is observed that the maximum 
of the analytical speedup for a given number of blocks on the diagonal is achieved at some finite number of parallel 
processors. The values of the parameters required to reach the maximum computational speedup are obtained. The 
benchmark calculations show a good agreement of analytical estimations of the computational speedup and practically 
achieved results. The application of the method is illustrated by employing the decomposition method to the matrix 
system originated from a boundary value problem for the two-dimensional integro-differential Faddeev equations. The 
block-tridiagonal structure of the matrix arises from the proper discretization scheme including the finite-differences 
over the first coordinate and spline approximation over the second one. The application of the decomposition method 
for parallelization of solving the matrix system reduces the overall time of calculation up to 10 times. 

Keywords: block-tridiagonal matrix, decomposition method, Thomas algorithm, parallel solution, computational 
speedup, three-body systems, Faddeev equations 


1. Introduction 

The finite-difference discretization scheme for the multidimensional second order partial differential equations 
(PDFs) leads to a system with the block-tridiagonal matrix. A precision of the numerical solution of the PDF depends 
on the number of knots over each dimension that usually approches thousands or even tens of thousands. In this 
case, each block of the matrix becomes large and sparse, but the block-tridiagonal structure is kept. A combination 
of the finite-difference method over one coordinate and the basis set methods 111] over other coordinates generally 
produces block-tridiagonal matrix with dense blocks. The intermediate case of the basis set with local support, namely 
splines 0], results in the band blocks with bandwidth comparable with the number of bands appeared from the finite- 
difference method. 

An adaptation of the Gauss elimination |3] to the block-tridiagonal systems is known as the Thomas algorithm 
(TA)li which is also called as the matrix sweeping algorithm j^l- In this algorithm, the idea of Gauss elimination is 
applied to the blocks themselves. A stable and time proved realization of another Gauss elimination based techinque, 
FU decomposition, for general matrices is available in packages like FAPACK 0). The direct application of this 
realization of the FU decomposition is often not feasible because of the large size of the matrix. As a result, the TA 
and special variants of FU decomposition are appropriate techniques for such large problems. 

The TA is robust and quite fast, but serial and thus hardly parallelized. The only available parallelization is at the 
level of operations with matrices. This reason does not allow one to use the algorithm at the modern computational 
facilities efhciently. In order to enable parallel and much faster solution, we have developed the decomposition 
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method (DM) for the block-tridiagonal matrix systems. The idea of the method consists in rearranging the initial 
matrix into equivalent one, namely with the “arrowhead” structure, which allows a parallel solution. The matrix is 
logically reduced to some new diagonal independent blocks, sparse auxiliary ones along the right and bottom sides, 
and a coupling supplementary block. The solution includes parallel inversion of the independent blocks and solving 
the supplementary problem. The DM includes using the TA for dealing with the new blocks of the initial matrix, but 
the parallel structure and possible recursivity of the method lead to remarkable growth of the performance. 

The speedup of the DM depends on the size of the supplementary matrix system. By default, this system is solved 
sequentially by the TA. If its size is relatively small, then the DM gives linear growth of the performance with increase 
of the number of computing units (processors). As the nonparallelized part of the method steadily enlarges, the linear 
growth is slowed down. The speedup with respect to the TA achieves its maximum at some number of computing 
units and then decreases. This issue can specifically be overcome by applying the DM recursively to each independent 
block and to the supplementary problem with the coupling matrix. In this case, the maximum computational speedup 
with respect to the TA can be increased by several times in comparison to the nonrecursive application of the DM. 

The concept of rearranging the initial matrix into “arrowhead” form has already been proposed for ordinary tridi¬ 
agonal matrices. The method comes from the domain decomposition, where the idea to divide a large problem into 
small ones which can be solved independently was introduced, see Ref. 101 and references therein. In Ref. |01 this 
idea is illustrated by the example of tridiagonal matrix obtained from the finite-difference discretization of the one¬ 
dimensional Laplace operator. It is stated there that the matrix of the supplementary problem in this method is the 
smallest one among the similar block methods under consideration. Therefore, the method is supposed to be the most 
efficient. Moreover, it is close to the cyclic reduction method |0] and, as it is shown in Ref. ifioll . is asymptotically 
equivalent to that method. The DM is closely related to the so-called divide-and-conquer method implemented in 
ScaLAPACK for banded matrices d , but designed directly to the block-tridiagonal matrix systems. 

In this paper, we describe the DM for the block-tridiagonal matrix system in detail. We show how one should 
solve the obtained new rearranged system and provide the algorithm for it. We underline in the text that the steps 
of the proposed algorithm can be executed in parallel. The analysis of the number of multiplicative operations for 
the DM is given. We analytically estimate the ratio of the multiplicative operations for the TA to the same quantity 
for different cases of application of the DM. This ratio is directly related to the computational speedup and thus we 
estimate the performance of the DM with respect to the TA. As a validation of the analytical results, we performed 
tests of the DM at the parallel supercomputing facility which confirmed our estimations. 

The block-tridiagonal matrix systems arise in many applied mathematics and physics problems. We are unable 
to enumerate here all possible applications, but briefly mention a few ones related to the quantum few-body systems. 
The three-body scattering problem in configuration space has been firstly reduced to the boundary value problem in 
Ref. II12I] where the numerical technique for the solution has been also described. This pioneering work gave rise to 
a class of papers inheriting the common approach for the numerical solution of the similar problems |T0, 14, 3- 

In the present paper, as an example of the application of the DM, we describe the computational scheme for solution 
of the s-wave Faddeev integro-differential equations and compare its performance with the TA or the DM. 

The paper is organized as follows. In Section 2, we describe in detail the DM and provide the algorithm for its 
implementation. We give a brief description of the conventional TA in Section 3. The analytical estimation of the 
number of multiplicative operations of the DM is presented in Section 4. In Section 5, the computational speedup of 
the DM with respect to the TA for various types of applications is given. The desired parameters of the computational 
system for reaching the maximum speedup are also provided. The validation of the analytical results is described 
in Section 6. In Section 7, we describe a possible application of the DM for solving the i-wave Faddeev integro- 
differential equation and give achieved time reduction. Section 8 summarizes the article by providing the conclusion. 


2. Decomposition method 

The block-tridiagonal matrix system under consideration is described by the equation 

+ QXi + BiXm = Fi, Ai = = 0, (1) 

where A,, B,, Cj, i - 1,...,A are the blocks of the matrix and B, are the blocks of the right-hand side (RHS) 
supervector F. The unknown supervector X is composed of blocks Xj. The size of each block of the matrix is n x n. 
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Figure 1: A graphical scheme of the rearrangement of the matrix system by the decomposition method. Top panel: the initial matrix system with the 
colored separation blocks. Bottom panel: the obtained rean'anged system in the “arrowhead” form which can be solved using parallel calculations. 
The nonzero blocks and vectors are denoted by thick lines, all other elements ai‘e trivial. The new logical blocks and corresponding vectors at each 
panel are denoted by thin lines. 
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whereas the size of each block of the supervectors is « x /, where 1 > 1 is the number of columns. The total size of the 
block-tridiagonal matrix is (nN) x (nN). 

The idea of the DM is presented in Fig. [1] The initial tridiagonal system ([1]) is rearranged into the equivalent, 
“arrowhead” form which allows the parallel solving. The rearrangement is performed in the following way: 

• One chooses a number of subsystems M on which the initial matrix is divided. They are shown in Fig. [1] (top 
panel) by thin squares. 

• M -1 separation block-rows of the whole system are arbitrary marked in rows from the second to the first before 
last. The block-columns corresponding to the diagonal block of each separation block-row are also marked. 

• The marked separation block-rows together with the correspondent block-rows of the supervector in RHS are 
shifted to the bottom of the system by virtue of block-row interchanges. This procedure does not affect the 
elements of the unknown supervectors X. 

• The remained marked separation block-columns are shifted to the right part of the matrix. This movement 
affects the structure of the supervector X in such a way that the separation blocks of this vector are logically 
placed sequentially in the bottom, see Fig.[T](bottom panel). 


As a result, the initial system is logically rearranged into “arrowhead” form. This new structure of the matrix system 
can be represented using the 2 x 2 block-matrix which gives one a better insight into the way of solution. The designed 
system can be written as 



W«\ 

Hj 




( 2 ) 


Here, the unknown solution h corresponds to the moved part of the full solution. The “arrowhead” form provides 
names for the matrix elements of Eq. (I2]i: S comes from a “shaft”, H is the “head” of an arrow, W«,l are the right 
and left “wings” of the arrowhead. This notation is shown in Fig. |2] The square superblock S consists of the new 
independent blocks at the diagonal 

S = diag {5 ^ ... ,5''^}. 

The matrix element H is the bottom right coupling superblock. This element couples the independent blocks 5* and 
consequently independent parts of the solution together to construct the complete one X. Other lateral superblocks 
W«, Wl present additional blocks of the matrix. The solution of the system (|2]i is given by the relations 


( s = S-'F,-S-'Wr/2 

[h = (h - WlS-i Wr)"' [Fh - WrS-'F,) 

The features of the numerical solving the system Q which affect the performance of the DM are following: 


• Due to the structure of obtained superblocks, the inversion of S is reduced to independent inversions of the 
diagonal blocks 5*, k = \,... ,M corresponding to each subsystem. These inversions can be performed in 
parallel. 


Since one has to calculate the products S 'Fj and S *Wr in Eq. (|3]l, it suffices to solve the equations 

k=l,...,M, (4) 


= F'i 


and is not needed to calculate inverse matrices 


1^5 *] ,k - \,... ,M explicitly. 


Once the solutions andz* are obtained, the matrix H-Wz,S ' Wr and the vector F;,-WlS are constructed 

as 

H- 

(5) 




The fact that only two blocks of the matrices and are not trivial drastically reduces the number of matrix 
operations in Eq. Q. 
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• The second line in Eq. Q can be treated as the supplementary matrix equation 

(h - Wz.S-‘W«) h = (f, - WlS-^F,) . (6) 

with the block-tridiagonal matrix. The block-size of this system equals to M— 1 and can be chosen much smaller 
than the size of the initial system ([1]). 

• Once the solution h = (/i\ ..., Y of the supplementary matrix equation (|6l) is obtained, the remaining part 
s of the complete solution ^ is calculated independently for each subsystem as 


where lY - = 0. 


As a result, the initial large matrix system is reduced to a set of independent small subsystems, coupled to the initial 
one by the coupling matrix H. The number of subsystems M should be chosen by a user in order to achieve the 
maximum performance of the DM. 

To solve the independent subsystems in Eq. (|4]i and the supplementary matrix equation (|6]l one can apply any 
appropriate technique. In our paper, for this purpose, we employ the TA as well as the DM itself. The DM calls 
for the mentioned steps of the described algorithm lead us to the recursive application of the method. This recursion 
allows one to considerably improve overall performance of the method because it makes the previously serial part of 
the algorithm to be parallel. In the following sections, we discuss in detail the application of the TA and the DM itself. 


5 ^ 








♦ # # 



Figure 2: The conventional notation of the new logical blocks and corresponding vectors of the rearranged “arrowhead” matrix system. 


3. The Thomas algorithm 

The conventional method for the direct solution of the block-tridiagonal matrix systems is the Thomas algorithm 
(TA) which is also known as the matrix sweeping algorithm 1111. The algorithm is based on the Gauss elimination 
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technique applied to the block-tridiagonal structure of the matrix. It includes two stages: the forward one and the 
backward one. For the matrix equation ([1]), the forward stage consists in the reduction of the initial matrix to the upper 
block-triangular matrix by sequential calculation of the auxiliary blocks 

( ^1 - @\Bi 

\ Jli = giBi, 

and 

/ Si = QiFi 

[Si = giiFi-AiSi-i), i^2,...,N, 

where and = (C, - The backward stage sequentially yields the rows of the solution in the 

backward order using the calculated in the previous stage auxiliary blocks: 

I Xn - Sm 

\ Xi = Si-^iXi^i, 

As a result, the solution X of the matrix system O is obtained. 

It is worth noting that the TA is serial: one can obtain the next auxiliary block or next block of the solution vector 
only from the previous element. Therefore, there is no possibility to make the TA to be parallel except to execute the 
matrix operations parallel themselves at each step. 


4. Number of multiplicative operations 


The analytical estimation of the computational speedup is performed using the information on the number of 
multiplicative operations of the DM. The additive operations are not taken into account because they are much less 
time consuming. As a benchmark for estimation of the computational speedup, we consider the TA for the block- 
tridiagonal system. Since the TA is sequential and easily implemented, it seems to be very convenient for comparison 
of the speedup. 

According to type of the block-operations of the described algorithms, for estimation of the computational speedup 
with respect to TA we have to take into account the computational costs for the matrix operations: multiplication and 
inversion. These matrix operations are already realized in linear algebra packages like BLAS and LAPACK. The 
algorithms for them are well known ini] and the computational costs have been estimated ifl^ . 

The number of multiplicative operations of the TA for calculation of the auxiliary blocks and the solution is given 
in Tab. [T] According to d, we assume that one needs exactly multiplicative operations for multiplication and 
exactly rP multiplicative operations for inversion. The inversion is performed by the LU decomposition which takes 
«^/3 multiplicative operations and then by solving a matrix system with unity matrix in the RHS which, in turn, 
takes 2n^/3 multiplicative operations. The uncertainty of our estimation is 0{n), that can be ignored for blocks of 
the block-tridiagonal matrix of size « > 10. As a result, it is straightforward to obtain that the total number of serial 
multiplicative operations of the TA is jill 


Mota = (3A^ - 2.){n^ + n^l). 

The estimation of the number of multiplicative operations of the DM is more difficult. In our analysis, we will 
firstly consider the DM without recursivity. It means that the solution of the subsysems in Eq. (HI and supplementary 
matrix equation (|6l) is performed by the TA. Later, we will study recursive DM calls for these parts of the method and 
consider consequences of these improvements. 

4.7. Decomposition method without recursivity 

The number of multiplicative operations for each stage of the DM is summarized in Tab. For solving one 
independent subsystem with different vectors in RHS using the TA, at the forward stage one calculates the auxiliary 
blocks for i - 1,... ,7\7t - 1 and for i - . ,Nk- Here Nk is the number of blocks on the diagonal of the k-th 

subsystem. It takes {'iNk — 2)n^ multiplicative operations. These calculations are general for both equations in (IHi. The 
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Calculation of 

Number of multiplicative operations 

@i. 

1 = 1 ,. 

..,N 

(2N- 

- l)w^ 

Jli, i 

= 1,... 

,N- 1 

(N- 

l)n^ 

Si, 

/ = 1 , . 

..,N 

(2N- 

■ l)rP'l 

Xi, 

i = A,. 

..,1 

{N- 

\)rP'l 


Total 


(3N - 2)(n" + n-l) 


Table 1: Number of multiplicative operations for the each pait of the TA as well as the total amount. The computational costs of the elementary 
operations are estimated according to Refs. (SIHIIl. 


Calculation of 

Number of multiplicative operations 
for each k - 1,... ,M 

solution of Eq. 0 

k = 1: (4-Nk - 2)n^ + (3Nk - 2)nH 

\<k<M\ {INk - 4)n^ + {3Nk - 2)nH 
k = M: (6Nk — 4)n^ + (3Nk — 2)n^l 

multiplications 
and in Eq. 0 

k = {1,M): +nH 

\<k<M:An^ + 2nH 

solution of Eq. 0 

(3M - 5)(n^ + n^/) (independently on k) 

multiplications in Eq. 0 

k^{l,M}: NkuH 

1 <k<M: 2NknH 


((4Ai - l)n^ + (4Ai - l)nH)+ 

Total 

[{6Nm - 3)n^ + {4Nm - !)«"/)+ 


+ Ijk =2 {'^Xkn^ + ^NktiH) + {3M - 5)(n^ + nH) 


Table 2: Number of multiplicative operations for the each stage of the DM as well as the total amount. The computational costs of the elementary 
operations are estimated according to Refs. dlHEl. 


auxiliary blocks S, are calculated for each RHS separately. Since the second equation in (|4]i has the complete RHS, it 
takes {2Nk — \)n^l operations to compute all S,. The sparse structure of the RHS of the first equation in (|4]i leads to 
{2Nk - l)n^ or multiplicative operations for the case when the vector in RHS has only the top nonzero block or only 
bottom nonzero block, respectively. At the backward stage, the number of multiplicative operations is {Nk - l)n^ and 
{Nic - l)n^l for the first and second equation in (|4]i. In total, the first subsystem (k = 1) takes {4Nk - 2)n^ + {3Nk - 2)n^/ 
multiplicative operations, the last one {k = M) takes {6Nk - 4)«^ + (3Nk - 2)n^l, the intermediate ones k - 2,... ,M —I 
take {INk - 4)«^ + {3Nk - 2)n^l. 

The calculation of the products and takes and multiplicative operations for 1 < k < M, 

whereas for k — {1,M) the products take only rr' and multiplicative operations. The construction of matrices (|5]l 
includes only additive operations and, therefore, their contribution is ignored. 

The solution of the supplementary matrix system (|6]l of the block size M—\ takes {3M - 5){n^ + n^l) multiplicative 
operations. 

The last stage, namely obtaining the remaining unknown vector 0 from the solution of the supplementary matrix 
system, takes 2Nkn^l and NkH^l multiplicative operations for 1 < k < M and k-{\, M], respectively. 

As a result, the total number of multiplicative operations of the DM is given as 

((4Ai - l)r? + (4Ai - \)nH) + [(6Nm - 3)n^ + (ANm - \)nH)+ 

M-\ 

+ Yj + 5NknH) + {3M - 5){n^ + nH), 

k=2 

where the last term corresponds to operations needed to solve supplementary matrix equation 0 and Nk is the size of 
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k-ih subsystem. It is worth noting that Nk can be different for different k and satisfies the only restriction 

M 

k=\ 


4.2. Recursive call of the decomposition method 

The recursive call of the DM implies using the DM also for solving the supplementary matrix equation ® and the 
independent subsystems (HJi. We will firstly consider calling the DM for solving the supplementary matrix equation 
and secondly for solution of the independent subsystems. After that, the formulas for the case of calling the DM both 
for solution of the independent subsystems and the supplementary matrix equation will be derived. 

4.2.1. Solution of the supplementary matrix equation 

The recursive application of the DM for solving the supplementary matrix equation of size M constructed on 
matrix H ®, leads to the change of the number of multiplicative operations in the general scheme, see Tab. |2 In 
particular, if we divide the supplementary matrix equation into m subsystems, then the last term in Eq. ® becomes 

((4Mi - \)n^ + (4Mi - l)nH) + ((6M,„ - 3)n^ + (4M,„ - \)nH)+ 

+ ^ {iMjn^ + SMjnH) + (3m - 5){n^ + nH), 

i =2 

where Mj is the size of y'-th subsystem of the matrix equation @. If the subsystems are of equal size, then My = 
(M - m)/m = Ml for j - 1,..., m. 

4.2.2. Solution of the independent subsystems 

The solution of the independent subsystems in Eq. © can be done by calling the DM for each such subsystem. 
If we divide ^-th subsystem {k - 1,..., M) into Jk subsubsystems, then for solving Eq. @ for k - \ the number of 
multiplicative operations equals to 

((5Lj - l)n^ + (4L{ - V)nH) + ((SL^^ - 2)n^ + {AL\ - V)n^l) + 

7,-1 

^ [l] {9n^ + 5n^i)] + (4-Ji -6)n^ + (37i - 5) n^l 

/-2 

and for k - M 

((8Lf - 2)n^ + (4Lf - l)n^/) + ((7L" - 3)n^ + (4L" - l)n^l) + 

- /m~ 1 

^ [if (9n^ + 5n^l)] + (67m - 10) + (37m - 5) n^l, 

1=2 

where LI^. is the block-size of the i-th subsubsystem for the k-th subsystem. Eor 1 < k < M the analogous number is 

((9L^ - 2)n^ H- (4L\ - l)n^l) + ((9^^ - 2)n^ + {AL\ - \)nH) + 

7i-l 

[L)[nn^ +5nH)\ + {lJk-n)n^ +OJk-^)nH. 

1=2 

In order to calculate the total number of multiplicative operations one needs to add multiplications for calculations of 
and multiplications in Eq. Q, sum up it over k, and to add multiplications for solution of Eq. ©. 
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n=400, N=3071 



Figure 3: The analytical estimation of the computational speedup of the DM with respect to the TA as a function of the number of parallel processors 
used for computation. The initial pai'ameters of the matrix are following: the size of each block is n = 400, number of blocks on the diagonal is 
N = 3071. The estimations are shown for various numbers of RHS vectors. 


5. Computational speedup 

In order to estimate the computational speedup of the DM with respect to the TA, one needs to study the relation 
of overall times of calculations for DM and for the TA. Since the DM can be executed in parallel on P processors, in 
addition to the time of serial calculations, the overall time includes the largest time of computation among all parallel 
processors. The overall computation time is directly related to the number of total serial multiplicative operations. 
Therefore, to estimate the computational speedup, we evaluate the ratio of the serial multiplicative operations for the 
TA and for the DM 

^ _ Mota 
Modm 

Below, we will firstly study the computational speedup of the DM without recursivity and after that with it. 


5.7. Decomposition method without recursivity 

The standard usage of the DM includes the TA for solution of the equations for independent subsystem as well as 
for supplementary problem constructed with coupling matrix H. Let us consider following cases: 

• The first one is when the number of subsystems equals to the number of parallel processors (M = P) and all 
subsystems are of equal size Nk - (N - P + l)/P for k - 1,..., M. In this case 

Modm = + (3P - 5)(n^ + n^l) 


and the computational speedup S is given by 


S = 


3N-2 


3P 




iP N 


( 10 ) 


So, the computational speedup increases linearly S ~ 3P/1 with respect to number of processes P for P N. 
The computational speedup depends only on the ratio Ijn. This dependence as well as linear growth of the 
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n=400, N=3071 



Figure 4: The analytical estimation of the computational speedup of the DM with respect to the TA as a function of the number of parallel processors 
used for computation. The initial parameters of the matrix are following: the size of each block is n = 400, number of blocks in the diagonal is 
N = 3071. The computational speedup slightly increases if one defines the first and last subsystems are larger than others and, simultaneously, 
equal computation time for each parallel processor. 


speedup are shown in Fig. |3] for ii - 400 and N = 3071. One can see that the speedup slightly increases if 
number of vectors / in RHS approaches n. Moreover, it is clearly seen from the plot, that the speedup decreases 
when the number of processors (and the logical subsystems) becomes large. The reason for this effect is the 
growth of the serial part of the DM, i.e. of the supplementary matrix system. The maximum computational 
speedup is achieved for the number of processors 


P = 


(N+l) 


5 + 


1 + Ijn 


which is close to ^ll{N + l)/3 for I <^n. 


• The second case is when the number of subsystems equals to the number of processors (M = P) and subsystems 
are not of equal size Nk for k - 1,..., M. Since the first and last processors perform less operations than other 
ones, it is optimal to define the equal time for each subsystem by varying Nk for k - 1,..., M. Let us suppose 
the sizes of the first and last subsystems to be increased in a and fi times, respectively. Then, the equal sizes of 
ki\\ subsystems for A: = 2 ,..., M - 1 is 


Nk 


N-(P-\) 
P-l + a+p' 


The computational speedup in this case is defined as 


"3F-5 + (5 + j4)(^^)' 

One can easily derive from Tab.l^that e [{INk + l)/4, {6Nk + l)/4] and Nm g [(7A(t + 3)/6, (12A^,t + 4)/10] 
for I < I < n. Therefore, for A(t » 3 and I « «, one approximately obtaines cr ^ 7/4 and /3 ^ 7/6. Fig. |4] 
shows that the computational speedup for I <s n increases slightly if one defines larger first and last blocks and 
simultaneously equal computation time for each parallel processor. For I ~ n, the similar almost negligible 
difference is kept. 
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• The last case is when the number of subsystems is proportional to the number of processors (M = iP) for i > 2 
and subsystems are of the same size as in the previous case. In this case, each processor is feeded by a queue 
of i subsystems which are executed sequentially. Changing Eq. (fTTT) allows one to estimate the speedup in this 
case: 

^ ^_ 3jV-2 _ 

- 3iP-5 + i{5 + j^){-^E^)' 

The additional factor i in front of brackets in the denominator comes from the fact that the queue for each 
processor consists of i subsystems. The result for M - P, 2P, 3P is shown in Fig. |5] It is clearly seen that the 
maximum speedup is diminished with increase of i. Nevertheless, for small number of processors the linear 
growth of speedup is kept independently of i. For larger P, the serial part of the DM drastically increases, 
especially for larger i. This fact leads to the considerable decrease of the speedup. 

The studied cases bring us to the conception for application of the DM. One should follow the hrst or second con¬ 
sidered case: subsystems should be of equal size or to have equal execution time. For both cases the number of 
subsystems should be equal to number of processors. The equal execution time is achieved by choosing the size of 
the hrst and last subsystems to be about 7/4 and 7 /6 of the size of each other subsystem. Since the difference between 
described cases is almost negligible, if the size of each subsystem is large, the initial matrix can be simply divided to 
equal subsystems. The computational speedup with respect to the sequential TA is shown in Fig. |3] 


n=400, N=3071 



Figure 5: The analytical estimation of the computational speedup of the DM with respect to the TA as a function of the number of parallel processors 
used for computation. The initial parameters of the matrix are following: the size of each block is n = 400, number of blocks in the diagonal is 
N = 3071. The computational speedup is presented for the case when the first and last subsystems are larger than others and, simultaneously, 
equal computation time for each pai'allel processor is defined. Different curves indicate different number of independent subsystems, M, which is 
proportional to number of processors P. 


5.2. Recursive call of the decomposition method 

We will firstly consider calling the DM for solving only the supplementary matrix equation and secondly calling 
it for solution of the independent subsystems and thirdly both for solution of the independent subsystems and the 
supplementary matrix equation. 
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5.2.1. Parallel solution of the supplementary matrix equation 

Based on the previous subsections and particularly Eq. (|9]l, we now additionally assume that all subsystems k - 
1,.. .,M are of equal size Nk - Ni. As we have seen in previous sections, this assumption does not significantly affect 
the performance. If we divide the supplementary matrix equation (|6l) into m independent equal subsystems to apply 
on each of them the DM, then the number of serial multiplicative operations is given as 

Modm = {iNin^ + SNin^l) + {iMin^ + SMpi^l) + (3m - 5)(n^ + n^l), (12) 

where Mj = (M - m)/m - Mi for j - I,..., mis the size of each subsystem. As a result, for M - P, the following 
cases can be considered; 

• The first case is if N and P are kept fixed. Then, the minimum of Eq. (fT2l i (and the maximum of the computa¬ 
tional speedup) is achieved for 


\P(7n^ + 5nH) 
^ 3(n^ + iPl) 



1 +lln 


(13) 


Besides, to apply the DM the condition (P — m)/m > 2 is required. It is satisfied asP>21. If/’<21, then 
Eq.O does not define the maximum of the speedup, instead it is achieved at the boundary m = P/3. As / « n, 
Eq. (IT 3 ]) is reduced to m ~ V7P/3 and the computational speedup S is given as 

^ ^ 3N-2 

l(N+l)IP + 2^l2\P-\9' 

Eig.l^shows the considerable increase of the speedup in case of the recursive call of the DM for the supplemen¬ 
tary matrix equation. 

• The second case is if only N is fixed. This case corresponds to the situation when the number of processors, P, 
is arbitrary and one has to choose the appropriate number to reach maximum speedup. One can find parameters 
of the extremum of Eq. (fT2l i and show that it is a minimum using the matrix of the second derivatives. In this 
case, the minimum of Eq. (fT2l i is achieved for 




1 -H Ijn 


N +\ 


(14) 


Since m < P/3, the parameters (fT4l i give minimum if P < (A H- l)/3, which is usually satisfied. The compu¬ 
tational speedup, calculated for N - 3071, as a function of P and m is represented in Pig.|7] The contour plot 
shows that the unique maximum (red color) of the speedup exists and is achieved by the given formulas (fT4l i. 


To summarize the considered cases, it should be pointed out that the recursive call of the DM for the supplementary 
matrix equation leads to remarkable growth of the performance, see Pig.|6] Using the described cases, one can choose 
the parameters of the algorithm based on its own computational facilities and the parameters of the initial matrix in 
order to achieve the maximum performance, see Pig. Q 


5.2.2. Parallel solution of the independent subsystems 

Now we consider parallelization of the solution of each of the M independent subsystems in Eq. (|4]i. Solving the 
supplementary matrix equation ® remains to be serial. We apply the DM for each equal independent subsystem, 
namely divide each subsystem with Ni blocks on the diagonal into J independent equal subsubsystems. (One can 
see that nonequal subsubsystems lead to the considerable reduce of the performance.) Then, the total number of used 
processors is P = MJ and the number of serial multiplicative operations is given as 

Modm = (1 1«^ + 5nH) N 2 + [n\lJ - 11) -H n^l(3J - 5)) -F 
+An^ + 2n^l + (n^ + n^l){3M - 5) + 2i?lNi, 
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n=400, N=3071 



Number of processors (P) 

Figure 6: The analytical estimation of the computational speedup of the DM with respect to the TA as a function of the number of parallel 
processors, P, needed for solution of the initial matrix. The standard application of the DM and the recursive calls for the supplementary matrix 
system are indicated. The number of parallel processors, m, needed for the supplementary matrix system is given by Eq. (H). 


Computational speedup 



10 20 30 40 50 60 70 80 90 100 

m 


Figure 7: The analytical estimation of the computational speedup of the DM with respect to the TA as a function of the number of parallel 
processors, m, needed for the supplemental^ matrix system and number of processors, P, for solution of the initial matrix by the DM. The 
maximum of computational speedup is shown by red color. The bottom dashed area indicates the domain where the condition rn < P/3 is not 
satisfied. 
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where A^i = {N-M+\)IM and N 2 - {N\-J+1)1 J. In the case when P (P < {N+ 1)/3) is kept fixed, using the method 
of Lagrange multipliers lll9n one can obtain that the maximum speedup is achieved for the values of parameters 


M = 


{hP + 3nH)P + 2nH(N + 1) 


3(n^ + rPl) 



The resulted computational speedup for I <& n is shown in Fig. |9] It is clearly seen that for small P the speedup 
behaves as S ~ 3P/ 11. For larger P, the speedup reaches its maximum and exceeds that obtained when the DM is 
recursively called for the supplementary matrix equation. 

One can also analytically obtain the maximum speedup for the case when P is arbitrary. It is achieved for the 
parameter 

_ (N+1) (lln^+5nH\ 

\ IrP + 3rP-l / ’ 

and J obtained as a real positive solution of the quartic equation 


jU 


Wn^ +5nH 
2iPl 


r -{N + 1) 


3(«^ + «2;)(ii„3 

(7u3 + 3fplf 2fPl 


= 0 . 


(16) 


However, in practice, it is much simpler to solve Eq. (fTfil l numerically if the size of the block, n, number of RHS 
vectors, /, and N are known. An example of reached speedup for the case when n - 400, I - N - 3071 as a 
function of M and J is shown in Fig. [8] The dashed area corresponds to the domain where the condition P<(A+l)/3 
is not hold and the parallel solution of the independent subsystems is not possible. 


Computational speedup 



5 10 15 20 25 30 35 40 45 50 

J 


Figure 8: The analytical estimation of the computational speedup of the DM with respect to the TA as a function of the number of subsystems, 
M, the initial matrix is divided into and the number of subsubsystems, 7, each subsystem is divided into. The total number of parallel processors 
involved in computation is P = MJ. The maximum of computational speedup is shown by the dai'k red color. The dashed ai'ea corresponds to the 
domain where the condition P < (A + l)/3 is not hold. 


5.2.3. Parallel solution of both the independent subsystems and the supplementary matrix equation 

If we apply the DM to both the independent subsystems and the supplementary matrix equation, then the number 
of serial multiplicative operations is the combination of Eq. (fT^ and Eq. (fT5] t. Keeping the notation of the previous 
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case and assuming also that P - MJ, the number of serial multiplicative operations is given as 


Modm = (l + 5i?i)N2 + {n^ilJ -11) + nHCiJ - 5)) + 

+4n^ + + {In^ + 5n^l)l -) + (n^ + n^l){3m - 5) + 2n^lN\, 

\ m ) 


where A^i - {N - M + 1)/M, N 2 - {Ni - J + \)l J and m is the number of independent subsystems to which the 
supplementary matrix equation is divided into. 

Keeping P (P < (N + l)/3) to be fixed, one can obtain the maximum computational speedup if the parameters are 


M = 


^3(5 + i+^//«) 


3P + 


AP 

1 +//n 


n + I j 


„ = > (3P. ^(5 

y 3(n^ + rPl) y/g \ 1 + Ijn n + I j \ 1 + //«/ 


and obviously J - PjM. The comparison of the computational speedup for this case with previous cases is shown in 
Fig. |9] It is clearly seen that the computational speedup in this case is larger than in previous cases. Unfortunately, 
this large speedup is achieved only for large total number of involved parallel processors, P, that is practically not 
feasible. 

If the total number of involved processors P is free, then the maximum computational speedup is achieved for 
values of the parameters 


(A?+1 )/11«3+5«2/\ _ jM(lyv’+5nH) 

\ liP + 3nH / ’ y 3(rP + rPl) 


(17) 


and J is the real positive solution of the cubic equation 


+ 


niiP +5n^l\ 

\j2 VA^ + 1 1 

{WiP + 5nH\ 

\ 2rPl ] 

1 (2n^l) ' 

In^ + 3rPl / 


V3(7n3 + 5 „ 2;)(„3 + ^ 0. 


(18) 


In Fig. |9] the parameters of the analytical maximum provided by formulas (I17I18I) are M = 106, J - 7, m - 16. 


6. Validation of the analytical resnlts 

Described analytical estimations have been validated using the SMP system with 64 processors (Intel Xeon CPU 
X7560 2.27GHz) and 2TB of shared operative memory. The Red Hat Enterprise Linux Server 6.2 and GCC 4.4.6 
(20110731) compiler are installed in the system. The DM has been implemented as an independent program written 
in C with calls of the corresponding LAPACK 3.5.0 Fortran subroutines |@1. The parallelization has been done using 
the OpenMP 3.0 1^ . 

In validation studies of the computational speedup (fTOl i only the solution time has been taken into account. This 
time includes time needed for computation itself as well as time needed for possible memory management during 
the solution. The time for generation of the initial blocks is ignored. The initial blocks have been generated by the 
well known discretization of the two dimensional Laplace operator in the Laddeev equations (see Section 6) with 
filling of zero elements by relatively small random values. This procedure guarantees that the generated matrices are 
well-posed. 

We experimentally estimated the computational speedup for the nonrecursive case when each subsystem © has 
an equal size. In Lig. [TOl we took the size of a block to be n = 100, 400 and we show the maximum speedup 
obtained in a series of experiments with fixed number of processors. It should be noted that the achieved speedup 
was equal or less than the analytical one. The averaged value of the experimentally obtained speedups is smaller 
than the shown values by 3% - 5%. This difference may be attributed to the nonideality of the memory management 
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n=400, N=3071 



Number of processors (P) 


Figure 9: The analytical estimation of the computational speedup of the DM with respect to the TA as a function of the overall number of processors, 
P, needed for parallel computation. The four cases described in the text are shown. 


N=3071, 1 RHS vector 



Number of processors 


Figure 10: The analytical estimation (solid line) of the computational speedup of the DM with respect to the TA as a function of the number of 
parallel processors used for computation as well as the speedup obtained in the validation studies (empty squai'es and circles). The parameters of 
the matrix are following: the size of each block is n = 100,400, number of blocks in the diagonal is N = 3071, only one RHS vector is used. 
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of the computational system, additional system processes running simultaneously and affecting our task, and hyper¬ 
threading of the processors. These issues are clearly seen for the case when the number of parallel processors is P = 48 
and especially P - 64. For these cases, the maximum achieved speedup is considerably less than the analytical one. 

Nevertheless, the general trend of the analytical result for the equal subsystems is clearly confirmed in the val¬ 
idation studies. The practical behavior of the computational speedup for other cases can also be observed in the 
computational systems with more processors. 


7. Application 


We apply the DM to the numerical solution of the boundary value problem (BVP) arised from the three-body 
scattering problem. One of the rigorous approaches for treating the three-body scattering problems above the breakup 
threshold is based on the configuration space Faddeev formalism 112ill . It reduces the scattering problem to the BVP by 
implementing appropriate boundary conditions. The boundary conditions have been introduced by S. P. Merkuriev il2ll 
and their new representation has recently been constructed in Ref. lllbll . After discretization of the BVP at some grid 
we come to the matrix equation of interest and then apply the DM. Below, we will describe key points of the neutron- 
deuteron (nd) scattering problem, computational scheme, and provide some results. 


7. 1. Statement of the problem 

The neutron-deuteron system under consideration is described by the differential Faddeev equations 12ill . The s- 
wave equations for the radial part of the Faddeev wave function component appear after projection onto the states with 
zero orbital momentum in all pairs of the three-body system. These s-wave integro-differential Faddeev equations for 
the intrinsic angular momentum 3/2 in the Cartesian coordinates are given by one equation |22, 3 


dx^ 


dy^ 


1 

+ V{x)- ^U{x,y) = dp^U(x',y'), 


(19) 


where 


/ 2 


V3 3/^ 

- xyu H- 

4 2 4 


1/2 


y 




3x2 ^3 

+ T 


1/2 


and p - cos (x,y). The solution of the i-wave Faddeev equation ( fT9l l for the energy of the system above the breakup 
threshold {E > 0) and for the short-range two-body potential V{x) should satisfy the boundary condition lll2|] 


U(x, y) ~ ifi{x) (sin qy + aoiq) exp iqy) + A{0, E) 


exp/ Vsp 
^ ’ 


( 20 ) 


where p - yjx^ -t- y^, tan0 = yjx, as p —> oo. The conditions t/(x, 0) = U{0,y) - 0 guarantee the regularity of 
the solution at zero. The energy and the relative momentum of the neutron, q, are associated with the energy of the 
two-body ground state, defined by the Schrodinger equation 


( 21 ) 


+ y{x)\ip{x) = sip{x). 


by the formula q^ — E — e. The functions a^iq) and A{6, E) to be determined are the binary amplitude and the Faddeev 
component of the breakup amplitude, respectively. The integral representations for these functions are of the form 

i2i 

oo CO 

aoiq) = - f dy sin qy J' dx(p(x)'K(x,y) 

^0 0 

and 


( 22 ) 


A{6,E)- dy sin ( Ve sin 0 y) J" c/x0( VFcosS, x)‘7C(x,y), 


(23) 


17 

















where 0(A:, x) is the scattering two-body wave function 


(p{k, x) sin (kx + 6(k)) 

x-^oo 


and 

1 

= \v(x) f dju^U(x',y'). 
2 J x'y' 

-1 


7.2. Numerical method 

The solution of the BVP with equation ( fT9l l for the ground and excited states as well as for scattering have been 
performed by various authors 1125112611 . The solution method is based on expanding the solution in a basis of the 
Hermite splines H over x and y 


1=0 


t=0 


and allows one to calculate binding energies and Faddeev components. For scattering, the amplitudes are then re¬ 
constructed from the Faddeev components using the integral representations (I22ll2^ . The matrix system is obtained 
after the discretization of the equation (fT9l l at the two-dimensional grid. In addition to the difficult treatment of the 
boundary conditions (l20l) in case of scattering, this approach has a computational disadvantage consisting in the rela¬ 
tively irregular and nonband structure of the matrix of the resulted system of linear equations. The block-tridiagonal 
structure does not arise. 



linear equations with a block-tridiagonal matrix. It is more appropriate for establishing the boundary conditions (l20l i 
as well. Taking into account the change of the unknown function as 'Kip, ff) = ^[pU{x,y), the transformation to the 
hyperspherical (polar) coordinates (p, 6} leads to the following equation: 


52 1 18 ^ 

dp^ 4p2 p2 80^ 


ape) 

+ V(p cos 0) - £ j 'W(p, 0) = y(p cos 0) J %{(p, 0') d0'. (24) 

ape) 


The integration limits are defined, in turn, as 0-(0) = |7r/3 - 0| and 0+(0) = n/2 - \nl6 - 0|. The boundary condition (l20l i 
can be represented as lllbll 


%l{p, 0) ~ 0o(p|f') i^oiqp) + ao(0)'Ho(^p)) + ^ ^k{p\0)ak{E)qdk{ VEp), 

k=\ 


(25) 


where J/o(0 and Elkit) are expressed by 

2 / 0 ( 0 = ,^^//®(0expi(7r/4-H7rk) 

through the Bessel functions Jq, Yq and the Hankel functions of the first kind 0. The unknown Faddeev 
component of the breakup amplitude is expressed in this case as the expansion 

A(0,E) - lim A(0, £',p) = lim ^ ak{E)4>k(pW), 

k=\ 

whereas the binary amplitude is given by a^iq). Here, the functions (^i:(p|0) form the basis generated by the eigenvalue 
problem for operator of the two-body subsystem. The procedure for extraction of the coefhcients of the presented 
expansion is given in Ref. iH. 
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Figure 11: The cubic Hermite splines 01} at the unit interval. 


The two-dimensional BVP defined by Eq. (l24t and boundary condition (l25l l is solved in the hyperspherical coor¬ 
dinates [p, 6} due to proper description of the boundary conditions and appropriate representation of the operator of 
two-body subsystem in Eq. (l24t . Therefore, the computational scheme meets the requirements of a good representa¬ 
tion of the 0-dependent operator. As a result, at Ng intervals over the 0-coordinate the unknown solution is expanded 
in the basis of cubic Hermite splines H{6) 021] 


Ng 

u{0,p)^Yj 

!=0 

whereas over the p-coordinate the finite-difference scheme has been taken. At the unit interval t 6 [0,1], the cubic 
Hermite splines are defined by four formulas 


hoa{t) - 2t^-31^ + 1, 

hio(t) - P - + t, 

hoi{t) — -2p+3fi, 
hn(t) = P-f. 


(26) 


The splines are shown in Eig. [TT] They transferred by the linear transformations to the two consecutive intervals of 
0-grid. Parameterizing these intervals by f e [-1,1], the splines can be written in the following way: 


//‘(f) 

H\t) 


-2p-3t^ + l, f6[-l,0) 
2p-3fi + l, fe[0,1] 

t^+2t^ + t, f€[-l,0) 
t^-2f + t, f6[0,1] ■ 


In order to obtain the appropriate 0-grid, the specially chosen nonequidistant x-grid for operator of the two-body 
subsystem has been used and transformed by the relation 

0,(p) = arccos 0,- € [0,7r/2]. 

V(p) 


19 












Here, the parameter X(p) defines the x-coordinate of the right zero boundary condition for some p. The x-grid consists 
of two parts: the fixed one at small x and the stretchable one at larger x up to X(p). As p increases, X{p) grows and 
the obtained 6>-grid becomes more dense near 7r/2. The quality of the x-grid and consequently of the 0-grid has been 
estimated by a precision of the ground state eigenvalue of the two-body Hamiltonian (12111 . The spline-expansion of 
the solution requires using as many as twice of numbers of coefficients in the expansion. The orthogonal collocation 
method with two Gauss knots within one interval is used for discretization of the studied equation. Over the 
equidistant p-grid with the mesh parameter Ap = p,„ - p,„_i the second partial derivative of the equation (l24l i is 
approximated by the second order finite-difference formula 

■^a4(p,e) ^-—2-. (27) 

dp- {XpY 

This approximation generates the block-tridiagonal structure for the matrix of the linear system, which can be written 
at the grid as follows: 


Ne 

Z 

1=0 


( 07 'o c,(p„,_i) - c,(p,„) + c,(p,„+i) ( 1 . 

(Ap)2 ’’’I p2 d^e 

e+(e<">) 

+ (y(p„,cos0f^) - 7 ^ - - Ay(p„,cos J H^"'(0')dd'^c/(p,„) 




= 0 (28) 


Here, the index (m) denotes the number of arc p„, on which the grid is constructed. As one can see from 

Eq. (l28l l. the resulted system of linear equations has a block-tridiagonal matrix. Moreover, since the inital equation is 
integro-differential, the diagonal blocks of the matrix are dense, whereas the offdiagonal ones are banded. The RHS 
supervector of the system is zero unless the last block-row consisting of terms of the boundary condition (l25l l at the 
last arc of the hyper-radius. 

The proposed numerical method guarantees the precision of the obtained solution of order of {ApY at equidistant 
grid over p and of order of (A0)^ if one designes the equidistant grid over 6. The precision over coordinate p can be 
improved up to (Ap)^ if one applies the Numerov method S for the given problem, that holds the block-tridiagonal 
structure of the matrix. The increase of the accuracy over coordinate 6 can be achieved by employing the quintic 
Hermite splines 

Il50(t) = -6f5 -H 15?"^-lOf^ H- 1, 

h5i(t) = -3Y + SY - +1, 

hs2(t) = -O.SY + 1.5Y - I .5Y +0.5Y, 

hYt) = O.SY -Y + 0.5f^ 

hsAif) = -3Y+lY-At\ 

hssif) = 6f5 - 15f4 H- lOfZ 

The downside of the quintic Hermite splines is that the number of collocation point in this case should be increased 
in more than one and half times comparing to the case of qubic ones. 

7.3. Results of calculations 

The calculations have been carried out for various laboratory frame energies, 4 < Eiab < 42 MeV, using the MTI- 
III potential ll^ for description of the two-body subsystem (l2TTi . For this potential, the achieved value of two-body 
ground state energy is E 2 b - -2.23069 MeV. The nonequidistant 6>-grid with about 500 intervals used for the precise 
calculations. Since the orthogonal collocation method with two Gauss knots within one interval is used, the 
common size for a matrix of the two-body operator (12111 is about 1000. A grid mesh for the uniform p-grid was varied 
from 0.05 fm to 0.01 fm. 

Within the asymptotic approach, the BVPs consisting of the equation (l24li and the boundary condition (l25T l taken 
at the hyper-radius p = p^ax have been solved. The expansion coefficients aifE,pma.x) as functions of p^ax have been 
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Figure 12: Left panel: the prelimiting breakup amplitude A{6, E,pmax^p) for ^lab = 14.1 MeV and p = pmax = 1400 fm. Right panel: the breakup 
amplitude A(0,E,p,flax) ioY Eiah = 14.1 MeV and p„iax = 1400 fm. 


calculated and used for reconstructing of the Faddeev component of the breakup amplitude 


A(e,E,p„ax) = limA(0,E,p„,ax,p) = 'Y ak(E,p„tax) lim ^k(p\S)- 

O-^oo f J p—>oo 


/(=! 


The prelimiting breakup amplitude, A(6,E,p,„ax,p), for Eiab = 14.1 MeV at some finite value of p^ax is shown in 
Fig.[l2](left panel). The breakup amplitude A{0, E,p,„ax) as p oa for the same energy is presented in Fig. [12] (right 
panel). The convergence to the limit is explicitly guaranteed by properties of the functions (l>k(p\0). The limiting forms 
of these functions as p —> oo are explicitly known for 0 e [0,7r/2]; 

2 

(f>kip\d) ~ —psinlkO. 

p^oo ^ 


Therefore, in contrast to the prelimiting case, the smooth behavior of the breakup amplitude near 90 degrees is ob¬ 
served, see the mentioned figures. 

The convergence of the binary amplitude aQ{q,pmax) and breakup amplitude A{0, E,p,„ax) as pmax —> °° has been 
obtained. For example, the p^a,-dependence of the inelasticity coefficient, 77 , and the phase shift, 6, defined as 


ao = 


- 1 
2i ^ 


(29) 


are presented in Fig. [13] The top panel shows that the calculated values of the inelasticity are identical for the three 
different precisions of the computations. The bottom panel shows that the decrease of the mesh step Ap for the p-grid 
to 0.01 fm leads to oscillating but significantly less biased values of the phase shift as pmax increases. The calculations 
using the Numerov method il confirm the finite-difference results. The oscillations are vanishing as p,„ax °° 
and the limiting value of the amplitude can be obtained by extrapolation. Nevertheless, in order to reach relatively 
small oscillations it is necessary to achieve values of p^ax > 1000 fm. The obtained values of the binary amplitude 
for different laboratory frame energies are in a good agreement with the binary amplitudes in Ref. ([35]]. The obtained 
breakup amplitude A(0, E,pmax) was compared with the results in Ref. OSll . This comparison for Eiab = 14.1 MeV is 
presented in Fig. [14] The results are in good agreement for values of 0 < 80°, whereas for 0 ~ 90° the differences are 
observed. 

The typical overall time needed for solution includes the time for construction of the matrix system, for solving it, 
and for calculating the scattering amplitudes. In practical calculations, the first time was comparable with the second 
one. The asymptotic approach requires only the last row of the solution of the matrix system, so the amplitudes are 
calculated relatively fast. For example, the overall time of the solution for the number of blocks on the diagonal 
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Figure 13: The calculated values of the inelasticity coefficient (H Tj (top panel), phase shift 6 (bottom panel) for Eiah = 14.1 MeV as functions 
of pmax- The dashed lines represent the values obtained using the second-order finite-differences |22J at the p-grid with mesh Ap = 0.03 and 
Ap = 0.01, whereas the solid line shows the values obtained using the Numerov method a with Ap = 0.05. The obtained inelasticity coefficients 
coincide with a good precision. 


N - 3000 with the TA is about 11 hours. The application of the DM with P = 64 computing units allowed us to 
reduce the solution time from 11 hours to about one hour, i.e. the overall time reduced by a factor of 10. For larger 
N and for the method using the integral representation, the overall speedup was not so large, but comparable. The 
obtained growth of the performance by a factor of up to ten was practically achieved for various computational studies 
of the given physical problem. 

8. Conclusions 

The DM for parallel solution of the block-tridiagonal matrix systems has been described in detail and the features 
of the method increasing the performance have been given. We have shown that rearranging the initial matrix system 
into equivalent one with the “arrowhead” structure of the new matrix allows one its parallel solving. The new matrix 
consists of diagonal independent blocks, sparse auxiliary ones along the right and bottom sides, and the coupling sup¬ 
plementary block of smaller size. The analytical estimation of the number of multiplicative operations of the method 
and computational speedup with respect to the serial TA has been performed. We studied the standard nonrecursive 
application of the DM as well as the recursive one. In recursive application, the DM is applied to the initial matrix 
system and, then, to the obtained independent subsystems and the supplementary matrix equation. The various cases 
arised in practice have been considered. The maximum computational speedup and the parameters providing the 
achieved speedup have been analytically obtained. The recursive application of the method allowed us to consider¬ 
ably increase the speedup in comparison to the standard nonrecursive use of the DM. For the considered cases, the 
achieved analytical speedup with respect to the TA was varied by a factor up to fifty. The analytical estimations for the 
standard nonrecursive case have been validated in practical calculations of solving the matrix system originated from 
the BVP for the two-dimensional integro-differential Faddeev equations. The numerical scheme for solution of these 
equations as well as the illustrative results have been presented. The overall growth of the performance by a factor of 
up to ten has been practically achieved in the computations. 
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Figure 14: Real and imaginary parts of the breakup amplitude A(6,E,p,„„x) for = 14.1 MeV. The results of Ref. are also shown for 
comparison. 
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